-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for sitemap index #65
Conversation
I personally like your choice of requiring including the One idea is that if the sitemap ends up being 50K+ links, then at the end of the run, if they ran it without |
crawler.py
Outdated
# index: zero-based index from which to start writing url strings contained in | ||
# self.url_strings_to_output | ||
try: | ||
sitemap_file = open(filename, 'w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if it would be good to use a context manager here, like with open(filename, 'w') as sitemap_file:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, definitely; changed.
I'm quite busy right now. I will also take a look at it this week. Sorry for the delay. |
869b215
to
d0ff61c
Compare
|
||
num_sitemap_files = math.ceil(len(self.url_strings_to_output) / self.MAX_URLS_PER_SITEMAP) | ||
sitemap_filenames = [] | ||
for i in range(0, num_sitemap_files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter, but I notice that in the Google documentation, they 1-index the sitemaps whereas they're 0-indexed there. Curious to hear your thoughts there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its really matter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm pretty certain it doesn't matter, so either way is fine I reckon.
Seems great for me. @Garrett-R Did you also confirm ? |
Yup, looks good to me, and I've used this branch already. |
As first mentioned in this issue, sitemaps over 50,000 URLs should be split into multiple sitemap files, with a single master index file pointing to all of the sitemap files. This PR adds support for outputting a sitemap index and multiple sitemap files.
Right now,
--output
is not a required parameter; but outputting an index and multiple sitemap files without writing them to files isn't quite sensible. The index contains pointers to files, so what would be the contents of the index when no output file is specified? Because of this, I'm currently requiring--output
when using the new--as-index
flag.In order to output an index, you have to include the
--as-index
flag. If you don't include the--as-index
flag, then the sitemap will be written to a single file, even if there are more than 50,000 URLs. My thinking was this would maintain backward compatibility; presumably everyone using the library right now is happy with the output so why change it? Another possibility would be to take this away as an option and just always write an index if there are more than 50,000 URLs, since this would be in line with the specification. If we do go with this, we would likely need to make --output required due to the first bullet described above.